Cut Document (Text Processing)

Synopsis

Cuts an input document into segments using regular expressions specifiying start and end of segments.

Description

This operator segments a text based on a starting and ending regular expression.

Input

document

Output

documents (Collection)
Collection of the segmented document.

Parameters

query type Specifies the type of the query. The available query types are: String Matching, Regular Expression, Regular Region, Indexed, XPath and JSONPath; Range: selection
string matching queries Specifies a list of string matching start and end sequences. Everything between will be used as result. See the operator documentation for details on string matching. Range: list
attribute type Specifies the type of the resulting attributes. If numerical or binomial is chosen, ensure that the returned result is interpretable. The available types are: Nominal, Numerical and Binominal; Range: selection
regular expression queries Specifies a list of attribute names and their corresponding regular expressions. The first matching group is used as value. See the operator documentation for details on regular expressions. Range: list
regular region queries Specifies a list of attribute names and their corresponding regular expressions. Two regular expressions might be specified in order to define the start and the end of a region. Everything in between the two matches will be delivered as result. Range: list
xpath queries Specifies a list of attribute names and their corresponding XPath queries. See the operator documentation for details on XPath. Range: list
namespaces Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h. Range: list
ignore CDATA Indicates if CDATA should be ignored when using the XPATH expression. Range: boolean
assume html If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this. Range: boolean
index queries Specifies a list of attribute names and the regions. Regions are specified as offset index and length of the match. Range: list
jsonpath queries Specifies a list of attribute names and their corresponding JSONPath queries. Range: list

Categories

Versions

Cut Document (Text Processing)

Synopsis

Description

Input

Output

Parameters